load_datasets_in_python

preimport.py

import pandas as pd
import sklearn
import statsmodels

import matplotlib.pyplot as plt
%matplotlib inline

data sets

  • statsmodelsの関数からRのデータセットを用いる方法
    • http://vincentarelbundock.github.io/Rdatasets/datasets.html
  • scikit learnで提供されているデータセット
    • 見やすいようにPandasで加工
    • http://scikit-learn.org/stable/datasets/#toy-datasets

dataset_statsmodels.py

import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
duncan_prestige.data.head()

    type    income  education   prestige
accountant  prof    62  86  82
pilot   prof    72  76  83
architect   prof    75  92  90
author  prof    55  90  76
chemist prof    64  86  90

dataset_sklearn.py

# https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/base.py
from sklearn import datasets
iris = datasets.load_iris()
iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_data["target"] = pd.Series(iris.target)
iris_data["target"] = iris_data["target"].apply(lambda x: iris.target_names[x])
iris_data.head()

    sepal length (cm)   sepal width (cm)    petal length (cm)   petal width (cm)    target
0   5.1 3.5 1.4 0.2 setosa
1   4.9 3.0 1.4 0.2 setosa
2   4.7 3.2 1.3 0.2 setosa
3   4.6 3.1 1.5 0.2 setosa
4   5.0 3.6 1.4 0.2 setosa